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Focused efforts by several international laboratories 
have resulted in the sequencing of the genome of the 
causative agent of severe acute respiratory syndrome 
(SARS), novel coronavirus SARS-CoV, in record time. 
Using cumulative skew diagrams, | found that muta- 
tional patterns in the SARS-CoV genome were strik- 
ingly different from other coronaviruses in terms of 
mutation rates, although they were in general agree- 
ment with the model of the coronavirus lifecycle. These 
findings might be relevant for the development of 
sequence-based diagnostics and the design of agents 
to treat SARS. 


Previously, cumulative skew diagrams have been 
employed successfully to analyze mutational patterns in 
various viral genomes. They have been used to: (i) link the 
nucleotide content changes to the genome organization, 
replication and transcription of double-stranded DNA 
viruses [1]; (i) correlate the transcriptional pattern of 
a bacteriophage T7 with its nucleotide content [2]; and 
(iii) associate the compositional biases with mutational 
pressures in retroviruses [3]. (See Box 1 on how to 
interpret cumulative diagrams.) 

The severe acute respiratory syndrome coronavirus 
(SARS-CoV) plus-strand genomic RNA (plus-gRNA) con- 
sists of two distinct parts: one (comprising two thirds of the 
genome) encodes the replicase polyprotein and the other 
encodes structural and other proteins [4,5]. In this paper, 
these parts are referred to as the long and short arm, 
respectively. Strikingly, there is a change in behavior of the 
cumulative skew diagram at the border of the arms in all 
coronaviruses sequenced to date (six representatives are 
shown in Figure 1), indicating a lower GC skew on the 
short arm. This behavior suggests that biological processes 
that distinguish the two arms (Box 2) are responsible for 
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the mutational pattern, rather than the fidelity of the 
replication machinery; the latter not would result in a 
constant slope of cumulative skew, as is the case in 
retroviruses [3]. The mutation rates (as indicated by the 
extent of the cumulative skew on the y-axis) do not appear 
to depend on a host organism: skews are similar in murine, 
avian and human 229E coronaviruses (Figure Ic,e,f) but 
substantially lower in SARS-CoV (Figure 1a, Table 1). 
The skew diagrams support the current model of 
coronavirus replication and transcription (Box 2), and 
GC skew is particularly illustrative in this regard because 
in both of these processes one RNA strand is single 
stranded. Deamination of cytosine to uracil is > 100 times 
faster in single-stranded DNA compared with double- 
stranded DNA [6], and this ratio is probably similar in 


Table 1. Mean excess of guanines versus cytosines in 
coronavirus genomes 


Virus genome* Extra guanines compared with 
cytosines per 100 bp of genomic 
sequence” 

Le s° L-S° 

SARS-CoV 1.8 —1.7 3.5 

BCoV 7.8 3.5 4.3 

MHV 7A 3.5 3.6 

PEDV 4.4 1.4 3.0 

HCoV 6.0 1.8 4.2 

IBV 5.9 4.2 1.7 


“Abbreviations: BCoV, enteric bovine coronavirus; IBV, avian infectious bronchitis 
virus; HCoV, human coronavirus (229E); PEDV, porcine epidemic diarrhea virus; 
SARS-CoV, severe acute respiratory syndrome coronavirus. 

>These averages represent the trends depicted in Figure 1 but without taking into 
account G+C content (which ranges from 37% to 42% in Coronaviridae). GC 
content does not affect the trends observed in Figure 1. 

°The change in number of guanines compared with cytosines is probably due to 
cytosine deamination in the minus strand on the short arm and reflects additional 
mutational pressure on that arm. Notably, this change is comparable with SARS- 
CoV and other coronaviruses, whereas the guanine excess on the long arm is much 
smaller. Definitions: L, long arm; S, short arm; L-S, change on short arm. 
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Box 1. Interpreting cumulative skew diagrams 
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Cumulative skew diagrams [1,7,8] can simplify the interpretation of 
biases in nucleotide sequence. An example of such bias is GC skew, 
which is a measure of the relative excess of guanines against cytosines 
on one sequence strand. It is calculated as ([G] — [C])/([G] + [C]), where 
[G] and [C] represent the occurrence of the guanines and cytosines 
within a specified sequence window. 

Such biases have been reported for bacteria [7,20,21] and double- 
stranded (ds) DNA viruses [1,21], and interpreted as evidence of 
asymmetry in mutation pressure because skew changes the polarity at 
the replication origin. The GC skew has been linked to the time the DNA 
strand spends in a single-stranded state [7], for example, during 
replication or transcription because cytosine deamination is much 
faster in single-stranded (ss) DNA compared with dsDNA (see [22,23] for 
in-depth reviews of the underlying mechanisms). 

Cumulative skew represents a numerical integration of the skew value 
across the genome and replaces the most significant changes in polarity 
by global maxima and minima. For example, a non-cumulative plot of 
GC skew is shown in Figure la for the genome of the virus SV40, where 
GC skew changes sign at a point near the 50% coordinate. It is unclear 
which of the multiple local polarity switches in the middle of the plot is 
actually the global switch. On the cumulative GC skew plot Figure Ib 
these polarity switches are seen to correspond to local minima and 
maxima on the GC diagram. The global maximum at 54% clearly 
separates two genome segments with the opposite deviations from the 
parity [G] = [C], and the slopes of the opposite linear trends on the GC 
diagram correspond to the respective mean GC skews for the two 
genome segments. GC skew is positive for the leading (left-hand side of 
the GC diagram) and negative for the lagging strand, as is the case with 
microbial genomes. 

The two segments of the GC diagram also correspond to the 
divergently transcribed coding sequences of SV40. Note that the slopes 
of the two halves of the GC diagram are different. The excess of G 
compared with Cin the leading strand in the late mRNA region of SV40 is 
almost half of the excess of C compared with G in the lagging strand in 
the early MRNA region. This suggests a contribution of transcription to 
the overall picture. 

Even more illustrative interplay of replication and transcription is a 
seen in a cumulative diagram of human papillomavirus [1] Figure Ic. 
Although the replication is bi-directional (from 0 or 100% on the 
diagram), transcription is unidirectional: all papillomavirus genes are 
transcribed from one strand. If there are separate biases induced by 
replication and transcription, they should act in the same direction in 
one half of a papillomavirus genome, and in the opposite directions in 
the other half. This model explains the observed behavior in Figure Ic 
such that the steeper slopes on the left-hand side reflect a sum of the net 
contributions of replication and transcription, and the right-hand side of 
the diagrams corresponds to their subtraction, where their effects 
almost cancel each other out (a near-horizontal cumulative plot 
corresponding to zero mean GC skew). 

The same rules apply to the analysis of RNA viral genomes. For 
example, for plus-strand RNA viruses the events taking place on the 
minus strand can be taken into account in much the same way as is done 
for the second strand of dsDNA. Because GC skew measures the level of 
cytosine depletion on one strand relative to its complementary strand, 
changes in the diagram shape enable researchers to infer the 
contribution of processes occurring on both strands, even in taxono- 
mical orders of single-stranded viruses. 


RNA. Thus, cumulative GC skew can be interpreted as a 
measure of cytosine depletion on one strand relative to its 
complementary strand. 

For most of the coronaviruses, there is almost a 
constant excess of G compared with C throughout the 
long arm (Figure 1), indicating an elevated C to U 
deamination in the plus strand. Similar to skews 
observed in DNA genomes [1,7,8], this probably results 
from the predominantly single-stranded nature of the 
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Figure I. (a) Non-cumulative and (b) cumulative GC-skew diagrams of the 
SV40 virus. (c) Cumulative GC skew of the human papillomavirus HPV-1A. For 
both viruses, the replication origin coordinate corresponds to 0% (or 100% 
because the genomes are circular). Reproduced with permission from Ref. [1]. 


plus-gRNA during replicase translation or minus-gRNA 
synthesis. 

The skew is less pronounced on the short arm (although 
changes in the slope of the curve are sometimes small in 
Figure 1b—f, they are all significant; data not shown) and, 
remarkably, the cumulative diagram even reverses its 
trend in SARS-CoV (Figure 1a). Most probably, this 
reflects higher rates of cytosine deamination on the 
minus-strand related to subgenomic mRNA synthesis. 
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Figure 1. Cumulative GC skew diagrams of coronaviruses. RNA genomes of six representatives of the Coronaviridae family are shown: (a) severe acute respiratory syn- 
drome coronavirus (SARS-CoV) [4,5], (b) enteric bovine coronavirus (BCoV) [13], (c) murine hepatitis virus (MHV) [14], (d) porcine epidemic diarrhea virus (PEDV) [17], 
(e) human coronavirus (229E) [18] and (f) avian infectious bronchitis virus (IBV) [19]. Diagrams with the window size of 60 bp were constructed as previously described 
[1,7]. Vertical bars mark the end of the replicase polyprotein gene in these genomes. Note the different slopes of the curves to the left and to the right of 
these vertical bars (which corresponds to the division points between the long and short arms) and the differences in vertical scales on different panels. 


The intracellular duplex of minus-gRNA with plus-gRNA 
protects them from cytosine deamination. If the first stage 
of transcription, which involves subgenomic mRNA 
template synthesis from the plus-gRNA, leaves minus- 
gRNA on the short arm as a single strand (Box 2, Figure I), 
then cytosine deamination will lead to the accumulation of 
uracils on minus-gRNA. Subsequently, synthesis of the 
new viral plus-gRNA from minus-gRNA will propagate 
these mutations, depleting guanines and decreasing the 
overall GC skew on the short arm of the plus strand. This 
explanation concurs with the model of subgenomic mRNA 
synthesis from minus-strand subgenomic RNA templates 
[9,10], for which there is experimental evidence in 
arteviruses [11] and murine hepatitis virus (MHV) [12]. 
The rate of cytosine deamination that is related to 
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sgmRNA synthesis is likely to be proportional to the 
difference between the slopes of the curves in the long and 
short arms (Table 1). 

Such a combination of mutational pressures for the two 
RNA strands indicates a higher overall substitution rate 
for the short arm, compared with the long arm. The 
supporting evidence for this comes from the comparison of 
two bovine coronaviruses (respiratory and enteric) that 
have differences in 107 nucleotide positions [13]. More 
than 80% of these differences correspond to the third base 
of a codon, indicating mutational pressure. I analyzed the 
distribution of these 107 positions and found that 59 of 
them localized to the short arm, suggesting an ~ 2.5-fold 
increase in polymorphisms on that arm. Most of these 
polymorphisms (85, ~80%) correspond to a C to U 
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Box 2. Coronavirus replication and transcription in SARS-CoV 
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The genome of the severe acute respiratory syndrome coronavirus 
(SARS-CoV) is a plus-strand genome RNA (plus-gRNA) of ~ 30 Kbp in 
length. Translation of the replicase polyprotein on the long arm of the 
genome is followed by minus-gRNA synthesis and transcription from 
the short arm. The long and short arms of the SARS-CoV genome are 
shown, together with the transcriptional products [eight subgenomic 
mRNAs (sgmRNAs)], in Figure la [16]. 

Transcription on the short arm produces a nested set of 3'-coterminal 
sgmRNAs, containing at their 5’-end a short leader sequence derived 
from the 5’/-end of the genome. A process for one of the subgenomic 
mRNAs is shown in Figure Ib [10-12]. After a minus-strand sgmRNA is 


synthesized on the short arm, a template switch enables the completion 
of the synthesis of the leader sequence (shown as open box on the left- 
hand side), skipping the long arm. 

The relative levels of transcription and replication in coronaviruses 
mean that subgenomic mRNAs are by far the most abundant 
coronavirus RNAs in the cell, whereas the genome-length negative 
strand RNA (minus-gRNA) is the least abundant (it is ~ 10% of the level 
of plus-gRNA) [24]. These levels and localization of transcriptional 
activity are likely to be linked to the difference in mutation rates on the 
short and long arms (see main text). 
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Figure I. (a) Genomic organization and (b) transcription process in the coronavirus genome. Different colors designate different types of RNA strands (i.e. coding and 
template strands). Plus-strand genome RNA (gRNA) and minus-gRNA are shown in black and red, respectively. Open box indicates the leader sequence (not drawn to 


scale). 


substitution on one of the strands, further emphasizing 
the role of cytosine deamination as the primary mutational 
force in coronaviruses. 

The rates of cytosine deamination in the SARS-CoV 
genome appear lower compared with other coronaviruses 
and this might explain the observation that the two 
sequenced strains diverged in genomic sequence by 
<0.003% [4,5]. Alternatively, if the epidemic came from 
a single clone, then only a short time span separates the 
two strains and that might explain the low divergence. 
Furthermore, the differences might be sequencing errors 
or PCR artifacts. However, it is worth pointing out that 
seven out of these eight polymorphisms also correspond to 
a C to U substitution on one of the strands. 

Comparison of the skew diagrams places SARS apart 
from other groups of coronaviruses but does not provide 
any evidence of recent genomic recombination between 
members of those groups as the origin of SARS-CoV (such 
an event would have produced a skew diagram with 
fragments corresponding to the parent genomes). These 
observations are in agreement with the phylogenetic 
analyses of coronavirus-encoded proteins [4,5], which 
have also indicated lower conservation of the structural 
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proteins, compared with replicase. This pattern appears to 
result from the mutational biases described above 
together with stronger selection on the replicase 
proteins and might influence the virulence and host- 
cell tropism of coronaviruses; examples of altered 
pathogenesis have been reported for murine corona- 
virus mutants [14]. 

Why are the mutational trends in the SARS-CoV 
genome so different from other coronaviruses? The cause 
is probably not in the host because another human 
coronavirus (229E) does not appear different from the 
other viruses examined (Figure le). Could the parameters 
of the virus-encoded RNA synthesis machinery, such as the 
speed of replication or transcription, or their relative 
turnover be responsible for this difference? The level of 
cytosine deamination, reflected in GC skew, has been 
hypothesized to depend on the time a DNA strand spends 
in a single-stranded state [1,7,8] (Box 1), and the same is 
probably true for RNA. Although the relative contribution 
of transcription in SARS-CoV is similar to that in other 
coronaviruses (Table 1, column L-S), the effect of replica- 
tion is much lower (Table 1, column L). This suggests that 
either minus-strand synthesis is faster or plus-strand 
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replication is slower in SARS or their relative turnover is 
lower compared with the synthesis of subgenomic mRNA 
template RNA. 

All these findings are relevant for sequence-based 
diagnostics and drug design against SARS-CoV and 
other coronaviruses because targeting the long arm with 
lower mutation rates should prove more robust against 
mutational changes in the target. This lends further 
support to a recent suggestion to design anti-SARS 
drugs based on the structure of the SARS 3C-like 
proteinase [15], which is encoded by genes on the long 
arm. These anti-SARS drugs will function as protease 
inhibitors that might block coronavirus replication. 
Another set of putative targets has been suggested in a 
recent publication [16] that has identified distant homo- 
logs of cellular RNA processing enzymes in the SARS 
genome. Notably, these are also encoded on the long arm as 
parts of the replicase polyprotein. 
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